According to PetCareRX there are roughly 600,000 dogs and 500,000 cats in NYC!!!
About the data:
Objective:
#import the necessary libraries
import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import regex as re
import matplotlib.pyplot as plt
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import (OffsetImage, AnnotationBbox)
from palettable.colorbrewer.qualitative import Pastel1_7,Dark2_7
import plotly.express as px
We would load the dataset, and perform initial Exploratory Data Analysis to get a borader picture of the data. We are also interested in performing some data cleaning tasks.
data = pd.read_csv('input/NYC_Dog_Licensing_Dataset.csv') #data for original dataset
data.head(20)
| AnimalName | AnimalGender | AnimalBirthYear | BreedName | ZipCode | LicenseIssuedDate | LicenseExpiredDate | Extract Year | |
|---|---|---|---|---|---|---|---|---|
| 0 | PAIGE | F | 2014 | American Pit Bull Mix / Pit Bull Mix | 10035.0 | 09/12/2014 | 09/12/2017 | 2016 |
| 1 | YOGI | M | 2010 | Boxer | 10465.0 | 09/12/2014 | 10/02/2017 | 2016 |
| 2 | ALI | M | 2014 | Basenji | 10013.0 | 09/12/2014 | 09/12/2019 | 2016 |
| 3 | QUEEN | F | 2013 | Akita Crossbreed | 10013.0 | 09/12/2014 | 09/12/2017 | 2016 |
| 4 | LOLA | F | 2009 | Maltese | 10028.0 | 09/12/2014 | 10/09/2017 | 2016 |
| 5 | IAN | M | 2006 | Unknown | 10013.0 | 09/12/2014 | 10/30/2019 | 2016 |
| 6 | BUDDY | M | 2008 | Unknown | 10025.0 | 09/12/2014 | 10/20/2017 | 2016 |
| 7 | CHEWBACCA | F | 2012 | Labrador Retriever Crossbreed | 10013.0 | 09/12/2014 | 10/01/2019 | 2016 |
| 8 | HEIDI-BO | F | 2007 | Dachshund Smooth Coat | 11215.0 | 09/13/2014 | 04/16/2017 | 2016 |
| 9 | MASSIMO | M | 2009 | Bull Dog, French | 11201.0 | 09/13/2014 | 09/17/2017 | 2016 |
| 10 | LOLA | F | 2006 | Miniature Pinscher | 10022.0 | 09/13/2014 | 10/03/2019 | 2016 |
| 11 | LEMMY | F | 2005 | Yorkshire Terrier | 10003.0 | 09/13/2014 | 10/26/2017 | 2016 |
| 12 | LUCY | F | 2014 | Dachshund Smooth Coat Miniature | 11215.0 | 09/13/2014 | 09/13/2019 | 2016 |
| 13 | RICKY | M | 2014 | German Shepherd Dog | 11220.0 | 09/13/2014 | 09/13/2017 | 2016 |
| 14 | SARAH | F | 2012 | Unknown | 10040.0 | 09/13/2014 | 09/13/2017 | 2016 |
| 15 | MURPHY | M | 2012 | American Pit Bull Mix / Pit Bull Mix | 10463.0 | 09/13/2014 | 09/28/2017 | 2016 |
| 16 | JUNE | F | 2010 | Cavalier King Charles Spaniel | 11238.0 | 09/13/2014 | 10/28/2019 | 2016 |
| 17 | ELIZABETH | F | 2013 | Cavalier King Charles Spaniel | 10022.0 | 09/13/2014 | 09/13/2019 | 2016 |
| 18 | AVERY | F | 2014 | American Pit Bull Terrier/Pit Bull | 10002.0 | 09/13/2014 | 09/13/2019 | 2016 |
| 19 | SOPHIE | F | 2011 | Boxer | 10308.0 | 09/13/2014 | 10/23/2019 | 2016 |
Column Name Column Description
• AnimalName User-provided dog name (unless specified otherwise)
• AnimalGender M (Male) or F (Female) dog gender
• AnimalYearOfBirth Year dog was born
• BreedName Dog breed
• ZipCode Owner zip code
• LicenseIssuedDate Date the dog license was issued
• LicenseExpiredDate Date the dog license expires
• Extract Year Year the data was extracted
Before making some assumptions, we would like to have an overview of the data. In the process, we would dentify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.
#dimensions of the dataframe
data.shape
(508196, 8)
#change all column headers to lowercase
data.columns = data.columns.str.lower()
Looks good!
#exploring the animal gender column
data.animalgender.unique()
array(['F', 'M', nan], dtype=object)
We can see that there are 4 categories and clearly nan and ' ' are of no use to us. Hence, for analysis purposes, we would get rid of those rows.
#replace empty values with N/A to delete them later
data.animalgender = data.animalgender.replace({' ':"N/A"}).fillna("N/A")
#count number of unknown rows
len(data[data["animalgender"] == "N/A"])
21
#remove the unknown rows
data = data.loc[(data["animalgender"] != "N/A")]
data.animalgender.unique()
array(['F', 'M'], dtype=object)
#total unique values in borough column
data.zipcode.nunique()
784
zipcodes = pd.read_csv('input/NYC_Borough_Zipcodes.csv') #NYC Borough Zipcodes
#keep only borughs that belong to NYC by their zipcodes
data = data.merge(zipcodes, on='zipcode', how='inner')
data.shape
(490082, 11)
#convert all to lower case
# data.borough = data.borough.str.lower()
#remove trailing spaces
data.borough = data.borough.replace(r"^ +| +$", r"", regex=True)
data.borough.unique()
array(['Manhattan', 'Bronx', 'Brooklyn', 'Staten Island', 'Queens'],
dtype=object)
#create and plot a cross table for gender vs borough
crosstb = pd.crosstab(data.borough,data.animalgender)
crosstb
| animalgender | F | M |
|---|---|---|
| borough | ||
| Bronx | 22865 | 29671 |
| Brooklyn | 61429 | 73070 |
| Manhattan | 73955 | 84087 |
| Queens | 43895 | 56637 |
| Staten Island | 20553 | 23920 |
#1. donut chart Doggos in NYC Boroughs
my_circle = plt.Circle( (0,0), 0.7, color='white')
#number of doggos in nyc boroughs
borough_groupby_count = pd.Series(data.groupby(['borough'])['borough'].count().values)
nyc_boroughs = pd.Series(["Bronx","Brooklyn","Manhattan","Staten Island", "Queens"])
boroughs_count=pd.concat([nyc_boroughs,borough_groupby_count],axis=1)
#plot
plt.figure(figsize=(6,6))
plt.pie(boroughs_count.loc[:,1], labels=boroughs_count.loc[:,0]+ ': '+boroughs_count.loc[:,1].astype(str), colors=Pastel1_7.hex_colors)
p = plt.gcf()
p.gca().add_artist(my_circle)
#add text
plt.text(1.5,0, 'Manhattan has the most number of Doggos in NYC', fontsize = 22, bbox = dict(facecolor = 'red', alpha = 0.5))
#string at the center of the donut
sumstr = str(round(np.sum(borough_groupby_count)/1000,2)) + 'K'
plt.text(0., 0., sumstr, horizontalalignment='center', verticalalignment='center',fontsize = 32)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
#2. barchart for Genders in NYC Boroughs
plt.figure(figsize=(15,8))
#stack and reset
stacked = crosstb.stack().reset_index().rename(columns={0:'value'})
# plot grouped bar chart
p = sns.barplot(x=stacked.borough, y=stacked.value, hue=stacked.animalgender, palette = 'PiYG')
#control aesthetics
#change theme - gridlines, axes & size
sns.set_theme()
sns.set_context("talk",font_scale = .8)
#adjust legend and axis labels
sns.move_legend(p, bbox_to_anchor=(1, 1.02), loc='upper left')
plt.legend(title = "Gender")
p.set(xlabel="Boroughs",ylabel="Number of Doggos",title="Dogs Genders in NYC")
# Remove borders
p.spines['top'].set_visible(False)
p.spines['right'].set_visible(False)
p.spines['bottom'].set_visible(False)
p.spines['left'].set_visible(False)
We would like to find the most popular Doggo breed in NYC. Both these visualzations can be used to answer this question.
#1. bar chart
#get the count of breed names and sort in descending order
breeds_groupby_count = (pd.Series(data.groupby(['breedname'])['breedname'].count().values).sort_values(ascending=False))
doggo_breeds = pd.Series(pd.Series(data.groupby(['breedname'])['breedname'].count()).sort_values(ascending=False).index.values)
breeds_count=pd.concat([pd.Series(doggo_breeds.values),pd.Series(breeds_groupby_count.values)],axis=1)
breeds_count = breeds_count.rename(columns={0: "breedname", 1: "count"})
#label the ones that have count less than 100
breeds_count.loc[breeds_count['count'] < 100, 'breedname'] = 'Other'
breeds_count = breeds_count.groupby(['breedname'],as_index =False).sum().sort_values(by='count',ascending=False)
plt.figure(figsize=(15,10))
p2 = sns.barplot(y="breedname", x="count", data = breeds_count.head(20),palette='Pastel2')
#control aesthetics
#change theme - gridlines, axes & size
sns.set_theme(style="whitegrid")
sns.set_context("talk",font_scale = .8)
#adjust legend and axis labels
p2.set(xlabel="Number of Doggos",ylabel="Breed Name",title="Most Popular Doggo Breed in NYC",)
# Remove borders
p2.spines[['top','bottom','left','right']].set_visible(False)
#horizontal lollipop chart
# Reorder it based on the values
# ordered_df = df.sort_values(by='values')
my_range=range(1,len(breeds_count.head(20).index)+1)
plt.figure(figsize=(15,10))
# The horizontal plot is made using the hline function
plt.hlines(y=my_range, xmin=0, xmax=breeds_count['count'].head(20), colors=Dark2_7.hex_colors)
plt.plot(breeds_count['count'].head(20), my_range, "o",color='#D3D3D3')
# Add titles and axis names
# plt.yticks(0, breeds_count['breedname'].head(20))
plt.yticks(my_range, breeds_count['breedname'].head(20))
plt.title("Most Popular Doggo Breed in NYC", loc='left')
plt.xlabel('Number of Doggos')
# plt.ylabel('Breed Name')
plt.grid(axis='y')
# plt.axis('off')
# Show the plot
plt.show()
Yorkshire Terrier, Shih Tzu and Chihuahua are some of the most popular breeds.
We can tell from the visualizations that there are a lot of Doggos whose breeds are not determined. We might want to invest more to gather more data for more accurate resources. Optimizing the business processes that have lead us to this data can be an efficient way to streamline the whole process.
We would like to find the density of doggos accross all of the NYC boroughs. We would be using map charts to visualize the same.
#store zipcodes, borough and count in data frame
zipcode_mapping = pd.concat([pd.Series(data.groupby(['zipcode'])['zipcode'].count().index),
pd.Series(data.groupby(['zipcode'])['zipcode'].count().values)],axis=1)
zipcode_mapping = zipcode_mapping.merge(data[['zipcode','borough']],on='zipcode',how='inner')
#rename columns
zipcode_mapping = zipcode_mapping.rename(columns={'zipcode':'ZIPCODE',0:'Count','borough':'Borough'})
#change data type to match it with data type of ZIPCODE in GeoJSON file
zipcode_mapping = zipcode_mapping.astype({"ZIPCODE":int, "Count":int})
zipcode_mapping = zipcode_mapping.astype({'ZIPCODE' : 'string'})
zipcode_mapping.tail()
| ZIPCODE | Count | Borough | |
|---|---|---|---|
| 490077 | 11697 | 225 | Queens |
| 490078 | 11697 | 225 | Queens |
| 490079 | 11697 | 225 | Queens |
| 490080 | 11697 | 225 | Queens |
| 490081 | 11697 | 225 | Queens |
px.choropleth_mapbox
<function plotly.express._chart_types.choropleth_mapbox(data_frame=None, geojson=None, featureidkey=None, locations=None, color=None, hover_name=None, hover_data=None, custom_data=None, animation_frame=None, animation_group=None, category_orders=None, labels=None, color_discrete_sequence=None, color_discrete_map=None, color_continuous_scale=None, range_color=None, color_continuous_midpoint=None, opacity=None, zoom=8, center=None, mapbox_style=None, title=None, template=None, width=None, height=None)>
#plot map using plotly express choloropleth mapbox
fig = px.choropleth_mapbox(zipcode_mapping, geojson=r"input/zip_code_040114.geojson",
locations='ZIPCODE',
color='Count',
color_continuous_scale="Pinkyl",
range_color=(0, 10000),
mapbox_style="carto-positron",
opacity=0.5,
featureidkey="properties.ZIPCODE",
zoom=9.25, center = {"lat": 40.7128, "lon": -74.0060},
hover_name=zipcode_mapping['Borough'],
hover_data=['Count'],
title="Active Doggo Licenses"
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# fig.update_geos(fitbounds="locations", visible=True,scope="usa",showsubunits=True,subunitcolor="Black")
fig.show()